This report explores a dataset containing quality ratings and chemical properties for 1,599 red wines (all from the Portuguese Vinho Verde red wine variant). The quality ratings were derived from the median of at least 3 wine experts who rated the quality of each wine between 0 (very bad) and 10 (very excellent).
The dataset contains 12 variables - 11 input numerical variables based on physiochemical tests, and 1 categorical output variable (quality) based on sensory data:
The dataset will be explored using a single variable at a time. The goal is to find out which property (or properties) affects the quality of the wine. The analysis will start with the quality variable, followed by an analysis of each of the input variables.
Wine Quality:
Based on a visual inspection,the quality values of 5 and 6 are the most common:
We will confirm that the visual inspection is correct:
## [1] 82.48906
82.49% (rounded to two decimal places) of the wines are of quality value 5 and 6.
Fixed Acidity:
The graph peaks at approximately 7.5:
Calculate the proportion of fixed acidity that lies in the 7 to 8 range:
## [1] 34.14634
Approximately 34.15% (rounded to two decimal places) of fixed acidity lies in the 7 to 8 range.
Volatile Acidity:
Very few wines have a volatile acidity of more than 1:
These are likely outliers. After removing these outliers we see:
Most of the volatile acidity is from 0.3 to 0.7 - a normal distribution with some peaks.
Citric Acid:
Peaks appear at 0 and 0.48:
By far the biggest peak is value 0. Calculate how many wines have a value of 0:
## [1] 132
132 wines have a citric acid value of 0. Calculate what proportion of wines have a value below 0.5:
## [1] 78.61163
78.61% (rounded to two decimal places) of the wines have citric acid values below 0.5 (including 132 wines with a value of 0 as calculated earlier).
Residual Sugar:
Outliers are observed:
After removing the outliers we see:
The values peak at approximately 2 for residual sugar.
Chlorides:
Outliers are observed:
Removing the outliers we see:
The chlorides peak at approximately 0.75.
Free Sulfur Dioxide:
Outliers are observed:
Removing the outliers shows:
The sulfur dioxide peaks at approximately 6.
Total Sulfur Dioxide:
Outliers are observed:
Removing the outliers shows:
The total sulfur dioxide peaks between 15 and 25.
Density:
Density has minimal variability:
Calculate the proportion of wines with density between 0.9945 and 0.9985:
## [1] 74.60913
74.61% (rounded to two decimal places) of the wines have a density between 0.9945 and 0.9985.
pH:
Outliers are observed:
Without outliers we see:
Calculate whether the majority of wines lie within the pH range of 3.2 to 3.4:
## [1] 53.97123
53.97% of wines have a pH of approximately 3.2 to 3.4.
Sulphates:
Outliers are observed:
Once removed we see:
Sulphate values peak at approximately 0.6.
Alcohol:
Outliers were observed:
Once removed we see:
There is a peak at approximately 9.5.
What is the structure of your dataset?
The red wine dataset contains 1,599 observations and 12 variables. 11 of them are input variables based on physiochemical tests. The remaining variable, quality, is the output variable based on sensory data. Each observation corresponds to one particular wine.
Here is the dataset structure:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
What is/are the main feature(s) of interest in your dataset?
The main feature of interest is that 82.49% of the wines are of quality value 5 and 6. I want to find out which of the input variables lead to a high quality wine. Possible variables that could influence the quality of wines are:
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
The 4 input variables listed previously could contribute to wine quality. The bivariate analysis will look at how each of the attributes is distributed with a given quality value.
Did you create any new variables from existing variables in the dataset?
I did not create any new variables.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Once the outliers were removed from the chlorides and residual sugar graphs, what appeared to be left-skewed graphs were in fact normal graphs. Limits were used to scale down the x-axis variable, as this helped to analyze the data.
The univariate plots did not indicate which variables influenced wine quality. A new strategy was implemented - a correlation graph of the variables was plotted to determine which variables should be plotted against wine quality:
We see that volatile acidity (-0.39), citric acid (0.23), sulphates (0.25), and alcohol (0.48) are correlated with wine quality. These variables will be plotted on the y-axis and wine quality will be plotted on the x-axis of a box plot.
Volatile Acidity vs. Wine Quality:
Low volatile acidity is a sign of a good quality wine. This is negatively correlated with wine quality.
Citric Acid vs Wine Quality:
Good quality wines have high levels of citric acid - this is directly proportional to wine quality.
Sulphates vs. Wine Quality:
Good quality wines have high values of sulphates - these are directly proportional to wine quality.
Alcohol vs. Wine Quality:
Good quality wines have high levels of alcohol.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The correlation plot brought into focus the variables (volatile acidity, citric acid, sulphates, and alcohol) that are correlated with wine quality. It showed that other variables (such as chlorides or density) are not correlated.
The features of interest changed after this analysis. Fixed acidity was shown to be not correlated.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
An interesting relationship that was observed was that fixed acidity is correlated with volatile acidity, citric acid, density, and pH value. This will be investigated further to see what the relationship is (if any) between this correlation and wine quality.
What was the strongest relationship you found?
Alcohol is the strongest correlation associated with wine quality. The bivariate analysis shows that good quality wines have high levels of alcohol.
The correlation graph shows that fixed acidity is correlated with volatile acidity (-0.26), citric acid (0.67), density (0.67), and pH (-0.68). A scatter plot of fixed acidity with the former metrics will be plotted to try and understand the relation of these variables with wine quality.
Straight line behavior shows the correlation between the x and y axes. However, no relationship with wine quality is apparent.
The continuous variables need to be converted into different ranges. Each variable will be sliced into five partitions. These partitions will be plotted with variables that have correlated with wine quality.
Alcohol and Wine Quality with other properties:
These plots show that high sulphates and high alcohol influence wine quality, as does high citric acid and high alcohol.
Volatile Acidity and Wine Quality with other properties:
These plots show that lower values of volatile acidity influence wine quality.
Citric Acid and Wine Quality with other properties:
These plots show that citric acid does not influence wine quality.
Sulphates and Wine Quality with other properties:
These plots show that high values of sulphates influence wine quality.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
High alcohol contributes to good wine quality. Adding sulphates or citric acid will influence wine quality in a positive manner. Wine quality is also influenced with a lower value of volatile acidity.
Were there any interesting or surprising interactions between features?
Citric acid was shown to influence wine quality based on observing the correlation plot. However, from the multivariate plots we see that citric acid does not influence wine quality on its own - it must also be paired with high alcohol values to influence (positively) wine quality.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
I did not create any models with the dataset.
82.49% of wines in the dataset are of quality 5 and 6. This graph is important as we have to determine which other metrics lead to better wine quality.
Alcohol and volatile acidity are strongly correlated with the quality of the wine. We can observe that higher alcohol levels lead to good quality of wine.
Alcohol and sulphates are positively correlated with the quality of the wine. Higher values of sulphates and alcohol lead to good quality of wine.
The aim of this analysis was to find out which chemical properties influence the quality of red wines. The dataset used had 1,599 observations and 12 variables.
The Univariate Analysis showed that 82.49% of the wines are of quality value 5 and 6. The other histograms in this section did not provide much help in deciding what affected wine quality.
The Bivariate Analysis box plots of variables with respect to wine quality showed that 4 variables - volatile acidity, alcohol, citric acid, and sulphates - are involved in good quality wines.
The Multivariate Analysis scatter plots of variables correlated with wine quality resulted in useful information. Good quality wine is produced with:
Future Analysis:
The quality of the dataset is suspect. 82.49% of the wines are of quality value 5 and 6. Ideally, a dataset where the quality variable was much closer to a uniform distribution would be preferable. A much larger number of observations would also be preferable - tens of thousands or more, rather than the 1,599 found in the current dataset.